Translational Bioinformatics
China races to build record biobank to rival U.S. drugs research
China races to build record biobank to rival U.S. drugs research Biobanks store masses of biomedical data such as clinical records, genome sequences and other long-term health metrics that research and drug development depend on. As a fledgling researcher in U.S., Zhang Li was struck by the efficiency of extracting human tissue in the morning and mining it for data the same afternoon. Such a streamlined process had been missing from his years of training as a bio data scientist in China. Inspired, he returned home to Beijing to join the Chinese Institute for Brain Research and launch a national database that will collect blood and DNA samples from 33,000 children to help identify patterns of brain disease and their risk factors. "Biomedical data is extremely valuable and is fundamental for us to find solutions to diseases and to delay aging," said Zhang, surrounded by robotic arms carefully organizing blood samples.
- Information Technology > Communications > Social Media (0.78)
- Information Technology > Artificial Intelligence > Robots (0.71)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (0.56)
Biconvex Biclustering
Rosen, Sam, Chi, Eric C., Xu, Jason
This article proposes a biconvex modification to convex biclustering in order to improve its performance in high-dimensional settings. In contrast to heuristics that discard a subset of noisy features a priori, our method jointly learns and accordingly weighs informative features while discovering biclusters. Moreover, the method is adaptive to the data, and is accompanied by an efficient algorithm based on proximal alternating minimization, complete with detailed guidance on hyperparameter tuning and efficient solutions to optimization subproblems. These contributions are theoretically grounded; we establish finite-sample bounds on the objective function under sub-Gaussian errors, and generalize these guarantees to cases where input affinities need not be uniform. Extensive simulation results reveal our method consistently recovers underlying biclusters while weighing and selecting features appropriately, outperforming peer methods. An application to a gene microarray dataset of lymphoma samples recovers biclusters matching an underlying classification, while giving additional interpretation to the mRNA samples via the column groupings and fitted weights.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Minnesota (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.66)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (0.86)
- Information Technology > Data Science (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- North America > United States > California (0.04)
- Europe > France (0.04)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Government > Regional Government > North America Government > United States Government (0.93)
- Materials > Chemicals (0.93)
- Information Technology (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Appendix ProteinShake: Building datasets and benchmarks for deep learning on protein structures
Table 3: Comparison of models trained with different representations of protein structure across various tasks, on a random data split . The optimal choice of representation depends on the task. Shown are mean and standard deviation across four runs with different seeds. Table 4: Comparison of models trained with different representations of protein structure across various tasks, on a sequence data split . Table 5: Comparison of models trained with different representations of protein structure across various tasks, on a structure data split .
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (0.41)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > Canada (0.04)
- (2 more...)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.68)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Vision (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
- Overview (0.48)
- Workflow (0.47)
- Research Report (0.46)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA
Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tok-enization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by humans even if discovered. To address these challenges, we introduce MxDNA, a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.
- Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- (3 more...)
Efficient and Effective Optimal Transport-Based Biclustering: Supplementary Material
Z that represents some transfer of mass between elements of w and v . The proof is the same for W . Proposition 2. Suppose that the target row and column representative distributions are the same, The the Kantorovich OT problem and whose rank is at most min(rank(Z), rank( W)) . Proof of proposition 2. From linear algebra, we have that Proof of proposition 3. We suppose that The optimal transport problem can be formulated and solved as the Earth Mover's Distance (EMD) We report the biclustering performance on the synthetic datasets in table 2. At least one of our models finds the perfect partition in all cases. The gene-expression matrices used are the Cumida Breast Cancer and Leukemia datasets. Their characteristics are shown in Table 3. Table 3: Characteristics of the gene expression datasets.
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.71)
- Information Technology > Artificial Intelligence > Machine Learning (0.71)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.49)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (0.68)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.93)
- Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- (3 more...)